Running head: CORRECTION FOR MEASUREMENT ERROR IN LME 


Correction for Item Response Theory Latent Trait Measurement Error in Linear Mixed Effects 
Models 
Chun Wang!, Gongjun Xu’, Xue Zhang? 


1. University of Washington 
2. University of Michigan 
3. Northeast Normal University 


Correspondence concerning this manuscript should be addressed to Chun Wang at: 


312E Miller Hall 
Measurement & Statistics 
College of Education 
Box 353600 
Seattle, WA 98195-3600 
e-mail: wang4066@uw.edu 
phone: 206-616-6306 


Acknowledgement: This research was supported by the Institute of Education Sciences, U.S. 
Department of Education, through Grant R305D170042 (or R305D 160010) awarded to the 
University of Washington (University of Minnesota originally). The opinions expressed are those 
of the authors and do not represent views of the Institute or the U.S. Department of Education. 


Citation: Wang, C., Xu, G., & Zhang, X. (2019). Correction for item response theory latent trait 
measurement error in linear mixed effects models. Psychometrika, 84, 673-700. 


Correction for Item Response Theory Latent Trait Measurement Error in Linear Mixed Effects 
Models 


Abstract 


When latent variables are used as outcomes in regression analysis, a common approach that is 
used to solve the ignored measurement error issue is to take a multilevel perspective on item 
response modeling (IRT). Although recent computational advancement allow efficient and 
accurate estimation of multilevel IRT models, we argue that a two-stage divide-and-conquer 
strategy still has its unique advantages. Within the two-stage framework, three methods that take 
into account heteroscedastic measurement errors of the dependent variable in stage II analysis 
are introduced, they are the closed-form marginal MLE (MMLE), the Expectation Maximization 
(EM) algorithm, and the moment estimation method. They are compared to the naive two-stage 
estimation and the one-stage MCMC estimation. A simulation study is conducted to compare the 
five methods in terms of model parameter recovery and their standard error estimation. The pros 
and cons of each method are also discussed to provide guidelines for practitioners. Finally, a real 
data example is given to illustrate the applications of various methods using the National 


Educational Longitudinal Survey data (NELS 88). 


It is not uncommon to have latent variables as dependent variables in regression analysis. 
For instance, the item response theory (IRT) scaled 6 score is often used as an outcome measure 
to make high-stakes decisions such as evaluating performance of individual teachers or schools. 
However, there exist potential errors in estimating the latent 0 scores (or any other latent 
variables from factor analysis perspective), and ignoring the measurement errors will adversely 
bias the subsequent statistical inferences (Fox & Glas, 2001, 2003). In particular, measurement 
error can diminish the statistical power of impact studies, yield inconsistent or biased estimates 
of model parameters (Lu, Thomas, & Zumbo, 2005), and weaken the ability of researchers to 
identify relationships among different variables affecting outcomes (Rabe-Hesketh & Skrondal, 
2004). The consequence can be especially severe when the sample size is small, the hierarchical 
structure is sparsely populated, or when the number of items is small (e.g., Zwinderman, 1991). 

When measurement error follows a normal distribution with a constant variance, 
correcting for the error can be easily handled via reliability adjustment (e.g., Bollen, 1989; 
Hsiao, Kwok, & Lai, 2018). The main challenge of having IRT @ score as dependent variable is 
that the measurement error in 6 is heteroscedastic with its variance depending on true 9. With the 
growing computational power nowadays, a recommended approach to address the measurement 
error challenge is to use an integrated multilevel IRT model (Adams et al., 1997; Fox & Glas, 
2001, 2003; Kamata, 2001; Pastor & Beretvas, 2006; Wang, Kohli, & Henn, 2016) such that all 
model parameters are estimated simultaneously. This unified one-stage approach incorporates 
the standard errors of the latent trait estimates into the total variance of the model, avoiding the 
possible bias when using the estimated 0 as the dependent variable in subsequent analysis. 

Despite the statistical appeal of the one-stage approach, we advocate that a “divide-and- 


conquer” two-stage approach has its practical advantages. In the two-stage approach, an 


appropriate measurement model is first fitted to the data, and the resulting @ scores are used in 
subsequent analysis. This idea is in the same spirit as “factor score regression” proposed decades 
ago (Skrondal & Laake, 2001). The benefit of this approach includes clearer definition of 
factors, convenience for secondary data analysis, convenience for model calibration and fit 
evaluation, and avoidance of improper solutions. Indeed, it is known that unless an adequate 
number of good indicators of each latent factor are available, improper solutions (a.k.a., 
Heywood cases, negative variance estimates) can occur. Anderson and Gerbing (1984) found 
that with correct models, their simulation showed 24.9% of replications had improper solutions. 
With improper solutions, test statistics no longer have their assumed distributions, and 
consequently statistical inference and model evaluation become difficult (e.g., Stoel, Garre, 
Dolan, & van den Wittenboer, 2006). 

Moreover, it has been known that partial misspecification in a model causes large bias in 
the estimates of other free parameters in structural equation modeling (SEM). In the presence of 
misspecification, a one-step approach will suffer from interpretational confounding (Burt, 1973, 
1976), which refers to the inconsistency between the empirical meaning assigned to an 
unobserved construct and the a priori meaning of the construct. The potential for interpretation 
confounding is minimized when the two-step estimation approach is employed (Anderson & 
Gerbing, 1988). Furthermore, the specification errors in particular parts of an integrated model 
can be isolated by using the separate estimation approach. 

Another compelling argument in support of two-stage estimation is the convenience for 
secondary data analysis. In a large-scale survey such as NAEP or NELS88, usually hundreds of 
test items and educational, demographic, and attitudinal variables are included, such that droves 


of descriptive statistics, multiple regression analyses, and SEM models might be entertained. In 


this case, neither carrying out all of these analyses nor providing sufficient statistics for them is 
feasible. Oftentimes, these survey data provide either item parameters, or estimated 0’s along 
with their standard errors. Hence, the methods introduced in this paper will come in handy to 
handle secondary data analysis with limited available information. 

In this paper, we investigate different methods of addressing the measurement error 
challenge within a two-stage framework. These methods will be compared to the naive two-stage 
method and an integrative one-stage Markov chain Monte Carlo (MCMC) method (Fox & Glas, 
2001, 2003; Wang & Nydick, 2015) in a simulation study. We intend to show that the proposed 
two stage methods outperform the naive method and they produce comparable results to the 


MCMC method. 


1. Literature Review 


With the advent and popularity of Item Response Theory (IRT), the IRT-based scaled 
scores (i.e., 9) has been widely used as an indicator of different latent traits, such as academic 
achievement in education. Hence, 0 is treated as a dependent variable in various statistical 
analysis, including simple descriptive statistics (Fan, Chen, & Matsumoto, 1997), two sample t- 
test (Jeynes,1999), multiple regression (Goldhaber & Brewer, 1997; Nussbaum, Hamilton, & 
Snow, 1997), analysis of variance (ANOVA, Cohen, Bottge, & Wells, 2001), linear mixed 
models (Hill, Rowan, & Ball, 2005), hierarchical linear modeling (Bacharach, Baumeister, & 
Furr, 2003), and latent growth curve modeling (Fraine, Damme, & Onghena, 2007). In all these 
cited studies, 0 scores were first obtained from separate IRT model fitting, and then they were 
used as variables in different statistical models as if they were “true” values without 
measurement errors. Complications arise, however, if the latent 0 scores were estimated with 


non-ignorable measurement errors. 


If a linear test when a fixed number of items is given to students, the resulting 
measurement error (or standard error, SE) typically follows a bowl shape with SE being smaller 
when the true latent trait is in the middle (e.g., Kolen, Hanson, & Brennan, 1992; Wang, 2015) of 
the 6 scale. When an adaptive test is given to students, the resulting SE is more of a uniform 
shape (e.g., Thompson & Weiss, 2011; van der Linden & Glas, 2010). The differential SE, 
depending on the true 8 level and test mode, complicates the treatment of measurement error 
issue in the subsequent statistical analysis. 

There are quite a few studies that have accounted for the measurement errors in 6 
assuming a constant measurement error term. In other words, simple measurement error models 
precipitate corrections to estimate “true” variances and correlations from their “observed” 
counterparts. For instance, Hong and Yu (2007) analyzed the Early Childhood Longitudinal 


Study Kindergarten Cohort (ECLS-K) data using a multivariate hierarchical model to study the 


relationship between early-grade retention and children’s reading and math learning. Let Yi, 


denote child i’s T-score 'in school j in Year ¢, then the level-1 model in their analysis is 


generically expressed as 


Y,, =T), +e 


2 
7] uj uj? Cu Zz NO, oO; ). (1) 
The test reliability was then used to compute the error variance o in each year. Although 


correctly accounting for measurement error improves the estimation precision, this treatment 
overlooks the fact that the measurement error of IRT @ scores is not constant across the @ scale. 
A Statistically sound approach that follows through the assumption of IRT is to let 

&, ~ NO, o;.), however, the relaxation of the common variance assumption in Equation (1) 


ij 


' The T-score is a standardized score, which was in fact a transformation of an IRT @ score. 


imposes computational complexity to the model. The objective of this paper, therefore, is to 
investigate methods for addressing challenging measurement error issues in the two-stage 
approach. We need to acknowledge that this paper only focuses on the measurement errors 
occurred on the dependent variables, whereas there is extensive literature on dealing with 
measurement errors in covariates (i.e., independent variables). Methods for the latter scenario 
may include the method-of-moment (Carroll, et al., 2006; Fuller, 2006), simulation-extrapolation 
(Carroll et al., 2006; Devanarayan & Stefanski, 2002), and latent regression (Bianconcini & 
Cagnone, 2012; Bollen, 1989; Skrondal & Rabe-Hesketh, 2004). For a comparison of methods, 
please refer to Lockwood and McCaffrey (2014). 

The rest of the paper is organized as follows. First, we will introduce the multilevel 
model that is considered throughout the study. In other applications, both the measurement 
model and the structural model can take other forms as long as the latter is a linear mixed effects 
model, and all methods introduced in the paper still apply. Second, four different methods are 
introduced within the two-stage framework, including a naive method. Then, a simulation study 
is designed to evaluate and compare the performance of different methods, followed by a real 
data example. A discussion is given in the end that summarizes the pros and cons of each 


method. 


2. Models 


The model is comprised of two main levels, the measurement model and structural 
model. In this paper, we will focus specifically on the linear mixed effects model (LME) as the 
structural model in stage II inference. In particular, we will base the discussion on the scenario of 
longitudinal assessment, i.e., modeling individual and group level growth trajectories of student 


latent abilities over time via the latent growth curve model (LGC). Because the LGC model 


belongs to the family of LME models, the methods introduced in this paper can be easily applied 
in all specific types of LME models for different nested structures. 

At the measurement model level, the three-parameter logistic (3PL) model (Baker & 
Kim, 2004) is considered. The probability for a correct response yj;;z at time ¢ (¢ = 1,..., 7) for 
item j (j = 1,..., J) and person i (i = 1,..., V) can be written as 


1 
i exp | -D(a,9, -b,)| 


(2) 


PVs, = G54 jp>P ino) = Ci, (= ¢,) 


where D is a scaling constant that usually set to be 1.7. aj, bj-, and cj, are the discrimination, 
difficulty, and pseudo-guessing parameter of item / at time ¢ , and 6;, is the ability of person i at 
time ¢. In longitudinal assessment, although the item parameters could differ across time (i.e., 
the subscript ¢ is embedded for item parameters in (2)), anchor items need to be in place to link 
the scale across years (e.g., Wang, Kohli, Henn, 2016). 
In the structural model level, we have a LME model with 0; as dependent variables 

written as follows 

6, =X;B + Zu; +e; . (3) 
Considering the LGC model as a special case of (3), if assuming a unidimensional 6; is measured 
per time point, then both 6; and e; are 7-by-1 vectors. X; and Z are the 7-by-p and 7-by-q design 
matrices, and # and wu; are p-by-1 and q-by-1 vectors denoting fixed and random effects 
respectively. 7 is the total number of time points. In a more general case, Z can also differ across 


individuals (Z;). 


For the rest of the paper, we consider a simplest linear growth pattern, i.e., X = Z = 
1 0 
. But the methods discussed can be easily generalized to the conditions when X and Z 

1 T-1 
differ. For instance, if one is interested in the treatment effect, and let g; denote the observed 
covariate of treatment, with g; = 1 indicating person i belongs to the treatment group, and 0 
otherwise. Then X; is updated as X; = (Z, 9; X 11x4) whereas Z stays the same. Similarly, if 
one is interested in the treatment by time interaction, then X; = (Z, g; X [0,1,..., 7 — 1]") where 
the superscript “t” denotes the transpose throughout the paper. 

The random effects, u;, are typically assumed to follow multivariate normal 


distribution, 
w= (yi, ~MYN (= (9) -Su= [ey tral) 
and for simplicity, we assume an independent error structure, i.e., €, ~ N(0,0°). 
If a multivariate latent trait (i.e., D dimensions) is measured at each time point, let 0; = 

[Oita +» Dit 7) +» Vint) «+» Oipr]* with the first T elements refer to the latent trait at dimension 1 
across T time points, Equation (3) still holds. But X becomes a (DxT) -by-(Dx2) matrix taking 

1 0 
the form of Ip® , where Ip is an identity matrix of size D-by-D, and ® is the 

1 rai 
kronecker product. B = (Bo1,-»» Bop» Biz» +» Bip)’ and uj = (Ujoy, «» Uiop, Wirt» Ui1p)* both 


become (Dx2) -by-1 vectors of fixed and random effects respectively. 


3. Model Estimation 
3.1 Unified One-stage Estimation 


To estimate the multilevel IRT model simultaneously, the current available estimation 
methods include, but are not limited to, the generalized linear and nonlinear methodologies 
described in De Boeck and Wilson (2004), the generalized linear latent and mixed model 
framework of Skrondal and Rabe-Hesketh (2004), Bayesian methodology of Lee and Song 
(2003) including the Gibbs sampler and Markov chain Monte Carlo (MCMC, Fox & Glas, 2001, 
2003; Fox, 2010). These methods are suitable for a general family of models allowing 
linear/nonlinear relations among normal latent variables and a variety of indicator types (e.g., 
ordinal, binary). 

Among them, the first two approaches require numerical integration and calculation of 
the likelihood, which becomes computationally prohibitive or even impossible when the model is 
complex or the number of variables is large. Rabe-Hesketh and Skrondal (2008) admitted that 
“estimation can be quite slow, especially if there are several random effects”. The Bayesian 
approach requires careful selection of prior distributions for each parameter, which might not 
come naturally for researchers who are unfamiliar with Bayesian methods. Other methods that 
supposedly alleviate the high-dimensional challenge (von Davier & Sinharay, 2007) include: 
adaptive Gaussian quadrature (Pinheiro & Bates, 1995), limited-information weighted least 
squares (WLS), and graphical models approach (Rijmen, Vansteelandt, & De Boeck, 2008). All 
of these methods have proven to work well in respective studies. Even so, a divide-and-conquer 
two-stage estimation approach still has its own advantages (e.g., reasons presented at the 
beginning) and it is the main focus of this paper. Given the flexibility MCMC offers to deal with 


the 3PL model, we will use it as a comparison to the two-stage estimation methods. 
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3.2 Two-stage Estimation 


Let ¥ =(B, ¥,, 7”) denote the set of structural parameters of interest, and let 
3, = (a,b,c) denote the set of item parameters pertaining only to the measurement part of the 
integrated model (Skrondal & Kuha, 2012). Throughout this paper, we assume the item 
parameters &, =(a,b,c) are known to alleviate any propagation of errors (such as sampling 
error) from item parameter calibration. For readers who are concerned about item calibration 
errors, please refer to the method proposed in Liu and Yang (2018), namely, the Bootstrap- 
calibrated interval estimation approach. 

Within the divide-and-conquer two-stage estimation scheme, because the latent outcome 
variable @; (for person i) is measured with error, instead of observing 0;, one only observes 0; 
from stage-one IRT calibration, and 

0; = 6; + &, (4) 
where €&; is the vector of measurement errors with a mean 0 and covariance matrix, Zg,. Zg, is 
also known as the error covariance matrix, the magnitude of which depends on many factors, 
including (1) test information at 8;, which also depends on whether the test is delivered via 
linear mode or adaptive mode; and (2) IRT model data fit. In the first stage, both 0; and @; are 
estimated. Either the Maximum a Posteriori (MAP) or the Expected a Posteriori (EAP) is used 
to obtain the point estimate of 6; along with the error covariance matrix estimate, Lo, for each 
person separately. Chang and Staut (1993) have shown that when test length is sufficiently long 
and when MLE is used, ¢€; will follow normal distribution with mean 0 and variance proportional 


to the inverse of the Fisher information evaluated at true 6;, ic., 2, ~ 1-'(0,) . Their results can 


be generalized to multidimensional @’s and to MAP (e.g., Wang, 2015). Even though true 6; is 
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unknown in practice, we have Lo, = I-1(0;) by plugging in 0; instead of @;. That is, using 
I-1(@,) asa proxy to the error variance of 6; is still viable as long as 6; is close to the true 
value (e.g., Koedel, Leatherman, & Parons, 2012; Shang, 2012). Although Lockwood and 
McCaffrey (2014) argued that E [7 -1;) | is likely an overestimate of E[I~+(@;) ], and sucha 
positive bias can lead to systematic errors in measurement error correction based on test 
reliability, this bias is no longer problematic in our methods because we treat each 6; and Lo, 
individually, and we do not need a reliability estimate from F [7 -1;) | to correct for 
measurement error. 

Given the linear mixed effects model defined in Equation (3), the likelihood of both 


random and fixed effects is therefore 


[TOs XB + Zu,,0°T,) |o(u50,%,), (5) 


i=l 
where N denotes sample size and ¢(.) denotes the multivariate normal density. The likelihood in 
Equation (5) assumes that the random effect follows a multivariate normal distribution with a 
covariance matrix of },,. A non-normal distribution of the random effect is also allowed if 
needed. Maximum likelihood estimation proceeds with integrating out the random effects first, 


leading to a marginal likelihood of 
N 
[1] | @:X,8+ 2u,,0°T,) ]o(u,:0,2, du , (6) 
i=l 
which needs to be maximized to find the solution of , sapeta . Then the individual coefficient u; 


will be predicted via the best linear unbiased predictor (BLUP). 
Combining the linear mixed effects model in Equation (3) with the measurement error 


model in (4), the likelihood in Equation (5) is updated as 


12 


N n an 
Ly,6,u) =|] | oO;X,B+Zu,,0°T,) |ou,:0,2,, 9858, X,) (7) 
i=l 
in which case both random coefficient u; and latent factors 8; need to be integrated to obtain the 


marginal likelihood. In Equation (7), g(.) denotes the density of the measurement error 


distribution. Therefore, the joint log-likelihood of ®= (B, ¥.,,, 7”) based off (7) is written as 
N N 4 A 
l(y,0,u) => l(y,0,,u,) = > {log [ (9; X,B+ Zu,,0°T,) | +log p(u,;0,5,)+log 9(8;8,,.d, )} . 
i=l i=l 


(8) 
This equation will be used throughout the following explication. 

We need to emphasize that the discussion hereafter is based on the assumption that the 
measurement error follows normal or multivariate normal distribution with error covariance 
matrix Lo ,- Diakow (2013) suggested using Warm (1989)’s weighted maximum likelihood in 
stage I along with a more precise version of the asymptotic standard error (Magis & Raiche, 
2012). As the paper unfolds below, the non-normal measurement error distribution is also 
allowed in the method described in section 3.2.3. In fact, both methods provided in sections 3.2.2 
and 3.2.3 are suitable for a level-1 variance-known problem (Raudenbush & Bryk, 2002, chapter 
7), and our goal is to provide an accurate method for secondary data analysis that is convenient 


and understandable for applied research (Diakow, 2013). 


3.2.1 Method I: Marginalized MLE (MMLE) 


When both @(.) and @(.) in Equation (8) follow or can be well approximated by a normal 


distribution (or multivariate normal), it can be derived that the marginal likelihood of the 
combined model, after integrating out both random coefficient u; and latent factors 8; in (8), has 


a closed form up to a certain constant (for detailed derivations, please see Appendix A) 
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expressed below. To be specific, given the joint likelihood in Equation (7), the marginal log- 


likelihood of the target model parameters can be shown to be 


N ¥ : a 
I(y) = log L(y) © -N log|X,|-— |X,A] +d Cog|Z,,|-loglo* X,' +2,)) 
i=1 

(9) 

a = — m Z * « |[2 

ca | oy tool)? OOF 7 + o°X,B)| + a"? u 
i=] 
where 

(U2) =D 407°Z'Z-04Z'(o" De +1,)'Z, and (10) 
=o Z'(XB)- Zo? Li +1) (2 +07 XB). (11) 


In above equations, 


(| denotes the determinant of a matrix, and || I denotes an inner product of 


a vector. The closed form marginal likelihood for the longitudinal MIRT model is also presented 
in the Appendix A. 

The MMLE proceeds with maximizing the closed-form marginal log-likelihood in (9). 
The ‘optim’ function in ‘stats’ library of R is used for solving the maximization problem. This 
function provides general purpose optimization based on Nelder-Mead, quasi-Newton, and 
conjugate-gradient algorithms. It allows for user-specified box constraints on parameters. Instead 
of using the default Nelder-Mead method (Nelder & Mead, 1965) which tends to be slow, we 
choose to use “L-BGFS-B” method available in the function because our objective function in 
(9) is differentiable. In particular, BGFS is the quasi-Newton method proposed by Broyden 
(1970), Fletcher (1970), Goldfarb (1970), and Shanno (1970), which uses both function values 
and gradients to construct a surface to be optimized. L-BGFS-B is then an extension of BGFS 
(Byrd et al., 1995) that allows box constraints in which each variable is given a lower and/or 


upper bound as long as the initial values satisfy the constraints. In our application, the constraints 
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include -1000< Bo, 8,<1000, 0.001<o,,,, 6,,<5, -.99<p<0.99, and 0.001<o?<10007. The initial 
values for all parameters are set at 0.1. Both the parameter point estimates and their standard 
errors are output from the function, with the former being the final estimates upon convergence, 
and the latter obtained from the Hessian matrix. In some extreme cases when Hessian matrix is 


not available, we use numeric differentiation available in the ‘numDeriv’ package instead. 


3.2.2 Method IT: Expectation-Maximization (EM) 


In this section, we will describe an alternative method to resolve the challenge of high- 
dimensional integration involved in the marginal likelihood. It is complementary to Method I 
when the closed-form marginal likelihood is not available, or when the numeric optimization 
fails to converge properly. 

In particular, when treating the random effects and latent variables, w; and 6;, as missing 
data, this method proceeds iteratively between the expectation (E) and maximization (M) steps 
until convergence. At the (m+1)th iteration, in the E-step, take the expectation of log-likelihood 


with respect to the posterior distribution of u; and 6; as 
‘ a> X< a(m 
EY |W) => [Uy.8,u)SO,u,|6,.2, .y)du,dG, , (12) 
i=] 


where f(6,,u,|6,,>, .w””) denotes the posterior distribution, and [(y,0,,u,) takes the form in 


Equation (8). The integration in (12) can be obtained easily when one samples directly from the 


posterior distribution, such that 


qz=l 


: v1 2 
Ewiv)=¥] S10) (13) 


* Originally, Z,, needs to be constrained to be non-negative definite. However, this is not a box-constraint 
that ‘optim’ function can handle. We therefore impose constraints on the variance and correlation terms. 
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where (8%, u;’) is the gth draw from the posterior distribution, and Q is the total number of Monte 


Carlo draws. This Monte Carlo based integration is appropriate even if the measurement error or 


random effects does not follow normal distributions, and hence we consider this approach more 


general than the MMLE method. 


exp 


If both the measurement error and random effects are indeed normal, then the conditional 


ectation in (12) has a closed form which can be directly computed without resorting to 


numeric integration. That is, given 0; and ya, from Stage I estimation and wy ”” from the mth 


EM cycle, the joint posterior distribution of (0;,u;) follows a multivariate normal, with a 


variance of 


3 ee 4+ prs —(620 ]J)-1Z | 


50m) = ee Sie 
—Z'(62(™)-1 (Som) -1 + Z* (620M “17 


-1 
ee 522 : (14) 


and a mean of 


ay” 
a~(m) 


By 


| 


(yt = yoy yr Ase 8: 4+ (620M 1F)-1X, BO 4+ ¥2 22) 1 ZF G20) 1-1 X BO] 
(322 _ » Yast © Deauh Imp ap) aes yA Con iy Bae. 7 2s = pom @ ua) ie O37: ae (620 1)-1X BO) 


(15) 


M-step proceeds with maximizing the conditional expectation in (12) with respect to y. 


Given the form of /(w,0,,u,) in Equation (8), £, Yi, 77 all have the closed form solution as 


follows, which greatly simplifies the maximization step, 


N 


jin) = [xix xX) B(8)' -ZEu,) |, ~ 


i=l 
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n2(m+l) ye (9, -x,p" — Zu,)' (8, =x pr - Zu,)) 
Oo m+ = l=. 


, 17 
NxT aoe 


N 
DoE” (um) 


yon = 
N 


(18) 


The notation of E“ indicates that, at the (m+1)th EM cycle, the expected values in (16)~(18) 


are obtained from the first and second moments of the posterior multivariate normal distribution 
S(O, U,; | 6, a ,w'”) with mean and variance specified in (14) and (15). Equation (17) adopts 


the expectation conditional maximization (ECM) idea in Meng and Rubin (1993) in that the 
closed-form solution for residual variance only exists conditioning on the updated parameter 
B ("*) “The ECM algorithm shares all the appealing convergence properties of EM. 

If the measurement model is the multidimensional IRT model with D dimensions, and if 
the residual error covariance matrix is still assumed to be diagonal, then the aforementioned EM 
algorithm only needs to be modified minimally. In particular, in the E-step, one simply needs to 
replace I; with Ip7, whereas Z,, and X; take the updated forms. In the M-step, at the (m+1)th 
iteration, the closed-form update for B°"+ stays exactly the same as in (16). The update for 


6?" is modified as 


YE” (@-X,B -Zu,) (6, -X,B" -Zu,)) 
ae ee a se (19) 
NxDxT 
The standard error of the parameter estimates are obtained using the supplemented EM 
algorithm (Dempster, Laird, & Rubin, 1977; Cai, 2008). The principle idea is reiterated briefly as 
follows. The large sample error covariance matrix of MLE is known to be 
Y)=1/ WIL, -AW'. (20) 


VW Y)=l'W 
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where I (w\lY ) is the Fisher information matrix based on observed response data, Y. I,(W) is the 
natural by-product of the E-step as it is simply the second derivative of Equation (12) with 
respect to all elements in ¥%. A(y) is the fraction of missing information, which can be obtained 


via numerical differentiation as 


OM (w) 
oy 


AWW) = (21) 


lag 2 


(m+1) 


where M(w) defines the vector-valued EM map as wy’ = M (y””). Upon convergence, 
y = M(w). For details regarding the calculation of A(y) in general, please refer to Cai (2008) or 
Tian, Cai, and Xin (2013). We use a direct forward difference method (i.e., Eq. 8 and 9 in Tian et 


al., 2013) with a perturbation tuning parameter 7=1. For details with respect to the specific form 


of I.(w), please see Appendix B. 


3.2.3 Method III: Moment Estimation Method 


If framing the estimation problem from a slightly different perspective, the linear mixed 
effects model in Equation (3) actually leads to the mean and covariance structure as follows, 
Ug = E(0;) = X{B ; 2g = ZU,Z° +07 lr. (22) 
It implies that to recover the structural parameters, ¥=(B, ¥,,, 07), only the Z, and S (i,¢., 


estimated population mean and covariance of @) need to be obtained in Stage I, rather than the 
individual point estimate of 8; and Xg,. This is consistent with the traditional wisdom in 
structural equation modeling (SEM), in which the inputs can be the mean and covariance matrix 
rather than the raw data. In our application, we assume 6;,’s follow multivariate normal in the 
population. When this assumption is satisfied, the mean and covariance contain all information 
(i.e., sufficient statistics), and when this assumption is violated, this method may still provide 


robust, consistent parameter estimates. 
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In stage I, the 8; and Xg, are estimated from raw response data via the EM algorithm 
(Mislevy, Beaton, Kaplan, & Sheehan, 1992). In particular, without imposing any particular 


growth pattern on latent traits over time, the full joint likelihood is 


L(,,2,|y.4,b,c) =| |p," —p,,) "98.49. Zp), (23) 


i,j.t 
where @ (-) again denotes multivariate normal density. Then in the E-step, for the (m+1)th 
cycle, the conditional expectation of (Ug, dig) is 

E(log L(uy,2,)| As”. 25") = [1(y,24 | ¥.4,b,¢) P(A,” 2)”, y)dd, (24) 
where the integral can be obtained via Monte Carlo integration by drawing Q samples of 04’s 


a”) and covariance X”?. 


from multivariate normal with mean £2, 
M-step follows with maximizing the conditional expectation in (24) with respect to 


(u,, 4,5), using the following closed-form expressions, 


ay = =o APO| Ai” .E))", y,)d0 
(25) 
Ey) = “$y (0 - jah)" (0 ft” )PO| ay” .E5”, y,)d0 
i=l 


The estimators in (26) are maximum likelihood estimates of (Ug, ig), consistent in sample size 
(i.e., N) regardless of test length (Mislevy et al., 1992). 
In stage II, ¥ can be estimated using any off-the-shelf SEM packages, using “, and -. 


as input. An example is the R package ‘lavaan’ (Rosseel, 2012), from which the MLE estimates 


of ¥ are provided. Or in essence, the generalized least squares solution to f is 
B=(X'V'X)'X'V "yw, and the MLE of §, and 6? can be found based on the likelihood 


function 
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F =-log|Z>, Z7 +671, 


where “tr” denotes the trace of a matrix. As there are no closed form solutions to (26), 


Newton-Raphson method is usually used (e.g., Lindstrom & Bates, 1988). The only input in 


(26) is -. from stage I. This method is extremely fast computationally. Because the individual 


latent score 6; is not needed in stage II estimation, the measurement error challenge vanishes. 

Please note that this moment estimation method could also apply when X; differs 
across individuals, i.e., when evaluating treatment effect is of interest. In this case, the sample 
mean of X; along with fig estimated from Stage I will be treated as the mean-structure input, 
whereas an expanded covariance matrix including py » as Well as the covariance between X; 
and @ will be put into ‘lavaan’. In this regard, Stage II estimation needs minimum update, 
whereas Stage I estimation (i.e., Equations 25 and 26) need to be updated accordingly. 

In sum, the two-stage methods introduced in sections 3.2.1 and 3.2.2 rely on the 
assumption that 0; and Mo, are asymptotically unbiased. Whereas previous methods might 
suffer from such divide-and-conquer strategy due to finite sample bias in 0; and ee the 
third moment estimation method should be fine theoretically. One limitation of the method, 
however, is that sample size needs to be large enough to enable accurate (and consistent) 
recovery of “, and pay in stage I, especially yy has to be positive definite. The MMLE and 


EM methods, on the other hand, do not seem to be affected much by small sample size. 


4. Simulation Study 


Two simulation studies were conducted to evaluate the performance of five different 


methods, they are: (1) direct maximization of the closed-form marginal likelihood (MMLE), 


-tr(Z(Z yo 27+eL,)' ). (26) 
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(2) the EM algorithm, (3) the moment estimation method, (4) the naive two-stage estimation, 
and (5) the one-stage MCMC estimation. The first simulation study focused on the 
unidimensional 3PL model as the measurement model, along with the latent growth curve 
model as the structural model; whereas the second simulation study focused on the two- 
dimensional compensatory IRT model along with the associative latent growth curve model. 
Throughout the simulation studies, all item parameters were fixed at known values to 
eliminate any potential contamination of item parameter estimation bias on the other targeted 
parameters. In addition, only dichotomous items were considered, but the 3PL and M3PL 


model could be easily replaced by the polytomous response models if needed. 


4.1 Study I 
4.1.1 Design 


The fixed and manipulated factors in the study were drawn from the previous 
literature. Two factors were manipulated: examinee sample size (200 vs. 2,000), and 
covariance matrix of the random effects (Raudenbush & Liu, 2000; Ye, 2016). The 200 
sample size is typically seen in psychology research whereas the 2,000 sample size is seen in 
education research. The medium and small covariance matrix of }!,, were set as follows 
(Raudenbush & Liu, 2000; Ye, 2016), 


0.1 0.025 


Be 0.05 
0.025 0.05 


0.05 0.1 | cnegiuin) | 


| (small). 
The number of measurement waves was fixed at 4 (Khoo, Wes, Wu, & Kwok, 2006; Ye, 2016), 
and test length was fixed at 25, which is similar to the test length for science subject in NELS 
(National Educational Longitudinal Study). 

In terms of fixed effects, the mean intercept was set at 0 (1.e., 69>=0), and mean slope was 


set at 0.15 (i.e., 6; =.15). Given the medium slope variance of .1 specified above, the mean slope 
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of .15 leads to a medium standardized effect size of .5 (see Raudenbush & Liu, 2000). Regarding 


the 3PL item parameters, a-parameters were drawn from Uniform (1.5, 2.5), b-parameters were 


drawn from Normal (0, 1) (Cai, 2010), and c-parameters were drawn from Uniform (0.1, 0.2). 


The scaling factor D was set at 1.7. Residual variance was 0f=o7 = .15 (Kohli et al., 2015). 


The details of the MCMC method including the priors are presented in the Appendix C. 


As shown in the Appendix C, conjugate priors are used whenever possible to enable direct Gibbs 


sampler. However, because we considered the logistic model throughout the paper, the 


Metropolis-Hastings algorithm is used to construct the Markov chains of certain parameters (i.e., 


A). Otherwise, when the normal ogive model is considered, the efficiency of MCMC will be 
further improved. 

In stage I estimation, a combination of maximum likelihood estimator (MLE) and 
maximum a posteriori (MAP) estimator was used. That is, MLE was considered first and if 
the absolute value of the estimate was larger than 3, then the estimation method switched to 
MAP with a normal prior N(0,5). The recovery of the structural model parameters is the 
focus of this report, including mean intercept (Bo), slope (£;,), residual variance (a7), and 
covariance matrix of random effects (>',,). For these parameters, the average bias was 


computed as the mean of all bias estimates from all replications. Taking the mean intercept 


ree 1&(B - 
parameter as an example, the relative bias and RMSE were computed as ty Fo= Po) and 


R r=l By 


eee , 
go —f,)° . Here, R denotes the total number of replications, and 2/ denotes the 
r=l 


estimate from the th replication. 50 replications were conducted per condition. In addition, 
the average estimated standard error for every parameter from each replication was 


computed, and the final mean values across replications were reported. 
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4.1.2 Results 


Table 1 presents the bias and relative bias (in the parentheses) for the structural model 
parameters. Several trends can be spotted from this table. First and consistent with our 
expectation, the naive two-stage method generated the largest bias for the residual variance o?, 
and in most cases, the largest bias for the elements in the random effects covariance matrix }\,, 
(€.g., To). However, not all elements in }\,, suffered from high bias, which might be due to the 
unsystematic measurement errors across time (1.e., the measurement error is not in an explicit 
increasing or decreasing order). Second, both MMLE and EM method tended to perform well in 
most conditions by reducing the bias of 0” and elements in ¥.,,. There is no appreciable 
difference between these two methods. Although the MMLE works with the closed-form 
marginal likelihood, hence it circumvents the numerical integration that subjects to Monte Carlo 
error, the optimization in the six dimensional space can still cause numeric error. On the other 
hand, the EM works either with Monte-Carlo based integration or closed-form integration in the 
E-step, but the closed-form solution in M-step avoids numeric optimization. Therefore, numeric 
approximations appear in different steps of these two methods, resulting in slight to no 
differences between them. Third, the moment estimation method generated the most accurate 
parameter recovery among all methods as this method does not depend on the assumption of 
normal measurement error. Hence, when the population distribution is assumed normal, this 
method is recommended. Unsurprisingly, the MCMC method also produced accurate parameter 
estimates, and in the cases when sample size is large, the best parameter estimates among all 
methods. It is only when the sample size is small and when the covariance matrix of random 
effects is small that MCMC yielded slightly higher bias in ¥.,,. This could be explained by the 
known effect of “regression toward mean” for Bayesian estimates, and such an effect will 


diminish when sample size increases. 
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In terms of the manipulated factors, the true value of the random effects covariance 


matrix did not seem to affect the results much, so did not the sample size. The parameters 


(especially Bo, B,, and a”) from the moment estimation method seemed to improve slightly with 


larger sample size, simply because the mean and covariance matrix of 9 recovered better in stage 


I with a larger sample size. The other three methods treated each individual 6; from stage I as a 


fallible estimate from its own measurement error model (i.e., Equation 5), so increasing sample 


size does not help reduce the measurement error. Overall, our observations of results are similar 


to Diakow (2013)’s conclusion where she used gllamm command (Rabe-Hesketh, Skrondal & 


Pickles, 2004) in Stata (StatCorp, 2011) with adaptive Gauss-Hermite quadrature method. 


Table 1. Bias (and relative bias) of structural model parameters for IRT+LGC model 


Covariance 


Medium 


Small 


N 


200 


2000 


200 


2000 


MCMC 


006 
006 (.041) 
008 (.056) 
015 (.074) 


-.004 (-.083) 


019 (.194) 

003 
002 (.013) 
001 (.004) 
005 (.025) 


-.000 (-.009) 


.003 (.032) 
.009 


-.003 (-.021) 


001 (.006) 
030 (.296) 


-.014 (-.564) 


.020 (.405) 
-.002 
.000 (.002) 


-.003 (-.017) 


007 (.068) 


-.003 (-.120) 


003 (.070) 


Moment 


Estimation 
-.008 


-.008 (-.051) 
-.045 (-.302) 
-.015 (-.076) 


003 (.060) 


-.017 (-.168) 


.001 


-.006 (-.037) 
-,022 (-.145) 
-.007 (-.033) 


008 (.160) 


-.022 (-.217) 


-.009 


-.002 (-.013) 
-.036 (-.237) 
-.003 (-.032) 


000 (.010) 


-.009 (-.189) 


-.007 


-.002 (-.015) 


021 (-.142) 


-.007 (-.073) 


.004 (.144) 


-.008 (-.159) 


MMLE 


018 
003 (.019) 
065 (.433) 
006 (.032) 


-.006 (-.120) 
=,021 (=913) 


027 
.005 (.034) 
.065 (.437) 
.012 (.060) 
-.005 (.096) 


-.022 (-.218) 


010 
.009 (.061) 
.050 (.334) 


-.004 (-.039) 


.003 (.124) 


-.003 (-.053) 


Ol 
008 (.058) 
047 (.312) 


-.006 (-.064) 


-.000 (.004) 
.004 (.070) 


EM 


017 
.002 (.013) 
.063 (.420) 
011 (.053) 


-.008 (-.163) 
-.029 (-.201) 


025 
003 (.023) 
065 (.432) 
014 (.069) 


-.006 (-.114) 


021 (-.213) 

009 
008 (.056) 
044 (.296) 
011 (.110) 


-.004 (-.177) 


001 (.026) 

Ol 
008 (.057) 
044 (.295) 
000 (.003) 
001 (.037) 
001 (.027) 


Naive 


017 
007 (.047) 
141 (.937) 
028 (.140) 


-.002 (-.048) 


.021 (-.209) 

029 
.008 (.054) 
142 (.944) 
.034 (.170) 


-.002 (-.030) 
-.021 (-.213) 


007 
016 (.105) 
123 (.820) 
006 (.060) 
007 (.270) 


-.001 (-.027) 


008 
015 (.099) 
119 (.794) 
003 (.033) 
008 (.313) 
001 (.028) 
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Note: The relative bias for the mean intercept (i.e., Bg) is not reported because the true value is 0. 


Ona separate note, because the accurate estimation of 6; and Lo, are pivotal to the 


success of the proposed MMLE and EM methods, Tables 2 and 3 present 6; and Lo, recovery 


results. Note that for Table 3, the bias of the measurement error estimate is computed as 


[I -1(@;)-,/1-1(0;) for person i where @; is the true value for person i. Then the average bias is 


computed across all individuals, and finally the medium value is obtained across replications. 


The medium is used instead of mean because there are a couple of outliers that may severely 


inflate the bias. As shown in Table 2, MCMC produced the smallest absolute bias and RMSE 


simply because it uses information from all time points. The estimation precision from 


MLE/MAP is also acceptable. A clear trend is that the RMSE is evidently larger at later time 


points, which is due to the way we simulated item parameters, resulting in a lack of “difficult” 


items. Regarding the recovery of the measurement error, Beis Table 3 shows that on average, 


there is about 10% bias. Therefore, it is expected that if Warm’s WLE and bias-corrected 


measurement error computation is used (Diakow, 2013, Wang, 2015), the improvement of 


MMLE and EM over naive method should be more salient. 


Table 2. Average bias and RMSE of @ estimates for the UIRT+LGC model 

RMSE 
Small covariance 

MCMC MLE/MAP 


200 6,  -.001 
6,  -.003 
6; .007 
6,  .005 
2000 6, ~—.001 
6,  .001 
é,; 000 
6, 002 


Bias 
Small covariance 
MCMC MLE/MAP 


O11 
O15 
019 
019 
005 
017 
026 
O17 


Medium covariance 
MCMC MLE/MAP 


-.001 
-.003 
.007 
O11 
-.000 
-.000 
-.000 
002 


-.001 
014 
014 

-.024 
.007 
017 
.016 

-.024 


.186 
194 
222 
.276 
187 
.196 
22d 
.280 


242 
.260 
.293 
329 
.246 
.266 
20} 
093 


Medium covariance 
MCMC MLE/MAP 


201 
.220 
.279 
370 
.200 
220 
21 
366 


265 
284 
338 
439 
.263 
294 
338 
444 
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Table 3. Bias (and relative bias) of Lo, for the UIRT+LGC model 


N 
Covariance 
( a > Oo, 


Small 


024 (.087) 
031 (.106) 
037 (.135) 
016 (.141) 


200 
Medium 
.033 (.117) 


036 (.134) 
-.010 (.126) 
-.450 (.092) 


Small 


029 (.106) 
037 (.135) 
039 (.156) 
013 (.153) 


2000 
Medium 
041 (.142) 


038 (.147) 
-.013 (.146) 
-.675 (.094) 


Table 4 presents the average standard error (SE) of all structural model parameters for 


different methods under different conditions. Consistent with our expectation, the naive method 


generated higher SE for all parameters compared to MMLE and EM methods under all 


conditions. The SEs from MCMC was also slightly high because they contained Monte Carlo 


sampling error by nature. Again, the level of covariance (1.e., X,,) did not affect the magnitude of 


SE much, and EM yielded slightly lower standard error than MMLE, but the difference is 


marginal. The moment estimation method generated slightly higher SE because it did not take 


into account all individual information in stage I but rather only used mean and covariance 


estimates, hence “limited” information. For all methods, SE dropped when sample size 


increased. 
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2000 


Table 4. Estimated standard error for structural model parameters for UIRT+LGC model 


.037 
.022 
017 
.030 
.013 
.010 
012 
.007 
.005 
.010 
.004 
.003 


Medium covariance Small covariance 
MCMC Moment MMLE EM Naive MCMC Moment MMLE EM Naive 
Estimation Estimation 

Bo 043 .036 027 .018 .045 .037 .030 022 .017 
By .030 023 016 .010 .025 024 .018 014 .010 
go? .016 .007 .010 .006 .018 014 .008 .009 .005 
Too .038 .026 023.017 043 .026 .019 .016 .008 
To1 .017 012 012 .007 .017 O11 .008 .010 .004 
T1141 .019 O11 .008 .007 .013 O11 .007 .005 .004 
Bo 013 012 .009 .006 .014 O11 .010 .007 .006 
By .009 .007 005 =.003 .008 .007 .006 .004 .003 
og? .005 .003 .003 .002 .005 .005 .003 .003 .002 
Too .012 .009 .007 .005 .014 .008 .006 .005 .002 
To1 .006 .004 .004 .002 .005 .004 .003 .003 .001 
T141 .006 .003 .003 .002 .004 .003 .002 .002 .001 


4.2 Study II 
4.2.1 Design 


In this second simulation study, the two-dimensional simple-structure IRT model was 
used. The test length was fixed at 40 at each measurement wave, hence there were 20 items 
loading on each dimension. The item parameters per domain were simulated the same as in 
Study I. The only difference is, the mean of the difficulty parameter increased over time, which 
were taken to be the average of the mean 6 from the two dimensions at the corresponding time 
point. This way, the items tend to align better with 0 as the respective time points. The number 
of measurement waves were also fixed at 4, and the fixed effects were set at 6 = 
[0,0, 0.15, 0.15]. Here the first two elements refer to the mean intercepts and the last two 
elements refer to the mean slopes. Residual variance was fixed at o? = 0.15 for simplicity. 
Given that the size of the random effects covariance matrix did not affect the results much from 


study I, we decided to fix the covariance matrix as 
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As shown above, the intercepts and slopes were uncorrelated, whereas the two intercepts 
correlated and the two slopes correlated. This simplification resulted in a reduction of the total 
number of parameters, which, to some extent, benefited the MMLE method. This is because in 
MMLE, larger number of parameters means searching in a high-dimensional space. The EM 
method was not affected, however, because of the closed-form solution in both the E-step and 
the M-step. But similar constraints were still added in the EM estimation to make a fair 
comparison. 
4.2.2 Results 

Table 5 presents the bias and relative bias (in the parenthesis) of the structural model 
parameters. First of all, as expected, the MCMC method produced the most accurate parameter 
estimates for all parameters under both conditions. Second, consistent with the findings from the 
previous simulation study, all methods produced acceptable fixed parameter estimates, and the 
bias for 69, and Boz are second smallest for the moment estimation method. This may be 
because, with slightly shorter test length (20 per dimension vs. 25 from study I), the individual 6 
and its SE may be prone to larger error, whereas the population mean and covariance estimates 
are less affected. However, the difference is not salient. The naive method again yielded the 
largest positive bias for residual variance (a7) and intercept variance (o7,, and of,,). The 
moment estimation method, on the other hand, resulted in slightly large negative bias for residual 


variance but it generated accurate slope variance estimates. In contrast, the other three methods 


resulted in slightly negative bias for slope variance, and naive method even outperformed the 
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other two by a little margin. These results match both Diakow (2013) and Verhelst (2010), who 


found that in the hierarchical linear modeling, “within-cluster variance is overestimated by the 


naive method while between-cluster variance is recovered’. 


Table 5. Bias (and relative bias) of structural model parameters for MIRT+LGC model 


N MCMC Moment MMLE EM Naive 
Estimation 
200 Bor .004 -.019 .053 .049 .033 
Boz 001 -.004 .066 .062 .055 
Bit .001 (.008) .002 (.011) .003 (.022)  -.002 (-.012) .002 (.015) 
Bi2 .000 (.001) -.003 (-.022) -.001 (-.006) -.006 (-.039) -.007 (-.045) 
o? .008 (.053) -.083 (-.550) —.011 (-.062) 011 (075) = .212 (1.413) 
Gas 019 (.095) -.021 (-.104) -.012 (-.062) -.011 (-.055) .033 (.164) 
Ougiug2  ~-010(-.103) -.008 (-.078) -.002 (-.018) — -.002 (-.016) .016 (.164) 
Giied .020 (.098) -.022(-.111) -.006 (-.033) — -.007 (-.035) .024 (.118) 
Gis 012 (.058) -.008 (-.041) -.059 (-.298) — -.059 (-.295) -.053 (-.262) 
Ousu,,  ~-901 (-.011) .001 (013) -.029 (-.285) -.028 (-.284) -.027 (-.271) 
Cees .003 (.016) -.008 (-.041) -.056 (-.278)  -.056 (-.278)  -.054 (-.272) 
2000 Bo. -.001 -.012 .063 .060 .040 
-.000 -.0088 .062 .059 .040 
A .001 (.006)  -.004(-.026) -.004(-.026) = -.008(-.055) — -.003(-.018) 
Bi2 001 (.005) — -.003(-.021) .001(.004) = -.004(-.025) — -.000(-.000) 
o* 002 (.011) = -.088(-.585) .013(.085) .013(.085) — .212(1.409) 
ee .003 (.014) -.016(-.080) —_-.004(-.022) — -.004(-.021) .027(.134) 
Ou Uo,  ~-003 (-.029) -.008(-.079) —-.003(-.030) —-.003(-.025) .014(.137) 
Ovi .002 (.010) -.016(-.083) — -.006(-.032) —_-.006(-.032) .036(.179) 
Orin .001 (.007) -.016(-.079) — -.059(-.299) — -.059(-.298) — -.056(-.278) 
Oussur> 001 (.011) -.006(-.064) = -.031(-.310) — -.031(-.309) — -.029(-.289) 
Cas .002 (.008)  -.017(-.085) — -.058(-.294) — -.058(-.293) — -.054(-.270) 


Tables 6 and 7 present the recovery of 8; and %g, respectively. In general, the MCMC 


produced more accurate 0; estimates than the MLE/MAP method unsurprisingly. The RMSE 
increases slightly at a later time also due to lack of suitable items for the certain range of 0. As to 
the recovery of the measurement error, while the relative bias is around 10% for the first three 
time points, which is similar to the results in Table 5, the relative bias drops considerably for the 


last time point and the bias itself increases dramatically. This is again because of the mismatch 
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between the item difficulties and @ at time 4. From the LGC model where true 0’s were 


simulated, the ranges of 0 are (-2.5, 2.5), (-2.5, 3), (-3, 4), and (-4, 6) for the four time points 


respectively. However, the variance of item difficulty was fixed at 1 across all time points, so 


there were not enough items with extreme difficulty levels for extreme 6’s at time 4. It is 


anticipated that both the RMSE in Table 6 and the measurement error bias will decrease if items 


with wider difficulty levels are added. 


Table 6. Average bias and RMSE of @ estimates for the MIRT+LGC model 


Bias 

N=200 N=2000 

MCMC MLE/MAP MCMC MLE/MAP 

64,  =--002 033 ~—--.000 .042 
O21 005 044 ~— -.000 039 
612 .001 046 ~— -.000 .038 
O22 .002 042 ~~ -.000 .035 
643 ---001 052 ~~ -.000 .039 
O23 .001 .046 .000 .036 
O14 .000 041 = -.000 042 
O24 .000 022 ~~ -.000 .040 


Table 7. Bias (and relative bias) of 2g, for the MIRT+LGC model 


Fo FO» 9613 F614 


N=200 .037 .026 


.023 031 


(113) (081) = (118) ~— (101) 


N=2000 .034 042 


022 025 


(108) (120)  (.105) ~—(.098) 


RMSE 
N=200 N=2000 
MCMC MLE/MAP MCMC MLE/MAP 
315 443 323 448 
326 452 318 454 
309 493 317 483 
313 .605 311 595 
367 447 371 449 
370 457 367 453 
479 480 484 A84 
484 591 478 589 
F624 Fo, O65 Fo. 
-.091 -.084 -10.643 -6.122 
(.074) (.083) (-.001)  (-.003) 
-.236 -.175 14.367 -6.822 
(.082) (.089) (-.008) (.001) 


Table 8 presented the estimated standard error for structural parameters. Overall, the 


results are consistent with the previous findings that the naive method generated somewhat larger 


standard error because “the biased estimates of the variance components might affect the 


estimated standard errors of the regression coefficients” (Diakow, 2013). 


Table 8. Estimated standard error of structural model parameter 


MCMC Moment 


N=200 


MMLE EM Naive 


MCMC Moment 


N=2000 


MMLE EM _ Naive 
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Estimation Estimation 


Bor 047 034 032.046 049-015 Ol 010 015.016 
Bor 048 033 032.046 049-015 Ol 010 015.016 
Bir —-038 032 023. 032 033.012 010 007 010.010 
Biz _—«-038 032 023 033. 033.012 010 007 010.010 
o ~—-.016 003 O11 008 018 ~— 005 001 004 003.006 
Cio, 041 023 027 014. 050913 007 009 004.016 
igiiedy © 1029 O17 019.008 035.009 006 006 003.011 
a, ——-042 023 027 014.049 013 007 009 004.016 
o.,, 031 021 013.009.022.009 O11 004 003.007 
shag: 1102 016 009 006.016 ~—_.006 Ol 003 002.005 
11%12 
o2., —_ .030 021 014.009.022.009 010 004 003.007 


5. A Real Data Illustration 


In this section, we briefly compared the performance of five methods using a real data 
example from the National Educational Longitudinal Study 88 (NELS 88). A nationally 
representative sample of approximately 24,500 students were tracked via multiple cognitive 
batteries from 8" to 12" grade (the first three studies) in years 1988, 1990, and 1992. The 
science subject data were used in this section. The sample size was 7,282 after initial data 
cleaning, and we used list-wise deletion to eliminate the effect of missing data’. The data 
contains binary responses to 25 items in each year. The true item parameters were obtained from 
NELS 88 psychometrics report (https://nces.ed.gov/pubs95/95382.pdf). The mean discrimination 
parameters were 0.85, 0.95, and 0.95 for the three measurement occasions, with the standard 
deviation of 0.29, 0.30, and 0.30 respectively. The mean and standard deviation of difficulty 
parameters were (-0.28, 0.10, 0.22) and (0.90, 0.71, 0.96) respectively. The mean and standard 


deviation of guessing parameters were (0.20, 0.19, 0.18) and (0.14, 0.13, 0.12) respectively. In 


3 We used the list-wise deletion because we wanted to create a complete data set for illustration. Our 
intention was to evaluate the performance of different methods without possible interference of missing 
data. Because we used the NELS provided item parameters and because our structural model is simple, 
the possible bias introduced by list-wise deletion may be ignored. 
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stage I analysis, the unidimensional 3PL model was considered, both the MLE estimation for 
individual ability (@@“") and the EM algorithm for population mean and covariance were 


obtained. The estimated population mean and covariance were fi = (-.43, .08, .28) and = 


0.92 0.77 0.77 . 
0.77 0.96 0.84], whereas the sample mean and covariance estimates from OMLE were 
0.77 0.84 0.96 


‘ 1.25 0.89 0.86 
fi=(-.43, .09, .28) and2 = ]0.89 1.31 0.98 
0.86 0.98 1.26 


. The two means are close, whereas the sample 


variances were larger. 

Table 5 presents the parameter estimates and their standard error (in the parenthesis) for 
the five different methods. As reflected in Table 5, the fixed effects estimates from different 
methods were close. The naive method, as expected, resulted in largest residual variance and 
intercept variance estimates. Both MMLE and EM tended to yield smaller variance estimates, 
which are consistent with the findings in Diakow (2013). This is because the random variances in 
the data can actually be decomposed as measurement error, randomness across individuals 
(random effects), and randomness within individuals (i.e., residual error). By actively 
incorporating the measurement error term in the model, the other two variances were reduced. 

Also of note is that the estimated measurement error obtained in stage 1 for extreme 61 
(i.e., close to -3 or 3) could be over 1 (in particular for measurement waves 2 and 3) due to lack 
of information in the tests for students with extreme abilities. In this case, the imprecision in the 
estimated measurement error could adversely affect the parameter estimates in the MMLE and 


EM methods (Diakow, 2013). 
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Table 5. Parameter estimates and their standard error (in the parenthesis) for NLES 88 Science 
data 


MCMC Moment Estimation MMLE EM Naive 
Bo  ~-.349(.012) -.376(.011) -.324(.001) -.339(.009) = -.371(.013) 
Bi — .351(.005) .353(.004) .350(.002) .350(.003) .354(.005) 
o .054(.003) .145(.002) .078(.001) .076(.003) .315(.005) 
To  .828(.019) .775(.015) .599(.010) .596(.011) .906(.020) 
to1 --006(.006) .004(.004) -.006(.004) .007(.0004) —-.002(.006) 
T11__ .029(.003) .014(.002) .008(.002) —.0231(.0002) .032(.004) 


6. Discussion 


In this paper, we considered three model estimation methods for (secondary) data analysis when 
the outcome variable in a linear mixed effects model is latent and therefore measured with error. 
All of them fall within the scheme of two-stage estimation that embraces the advantages of 
“divide-and-conquer” strategy. Such advantages include convenience for model calibration and 
fit evaluation, avoidance of improper solutions, and convenience of secondary data analysis. The 
last aspect is especially appealing from a practical perspective because oftentimes, the raw 
response data is considered restricted-use data and not publicly available, whereas 6 (or certain 
linear transformation of it) with its SE are publicly available. 

The three methods explored in the study overcome the limitation of the naive two-stage 
estimation that ignores the measurement errors in latent trait estimates (@) when treating them as 
dependent variables. It is known that ignoring the measurement error in 6 when 6 is treated as a 
dependent variable still yields a consistent and unbiased estimate of fixed effects (1.e., #), but the 
standard error of will be inflated, and the random effects estimates (i.e., Z,,) as well as residual 
variances will be distorted. For the MMLE and EM methods, the point estimate 6; and its 
corresponding measurement error for each student per time point are obtained in stage I 


measurement model calibration. And these two pieces of information become the key input for 
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stage II estimation. The moment estimation method, on the other hand, only needs population 
estimates of the mean and covariance matrix from stage I as input. 

To elaborate, the MMLE method builds upon the assumption of (multivariate) normal 
measurement errors such that the marginal joint likelihood of the model parameters can be 
written in a closed form. This closed form marginal likelihood is then directly maximized to 
obtain parameter estimates. Neither the known challenge of curse of dimensionality (i.e., 
numerical approximation of a high-dimensional integration) nor the lengthy sampling iterations 
is an issue any more. Comparing to MMLE, the EM method has greater flexibility because it no 
longer requires the (multivariate) normal measurement error, which may not always be satisfied 
in practice especially when there are few items per dimension. Although in this paper and in the 
simulation studies, we still assume the measurement error of 6 follows normal/multivariate 
normal just to check the feasibility of the algorithm, it can be modified to incorporate non- 
normal measurement error cases. 

The modification can be established based on the importance sampling idea. The critical 
piece to facilitate the entire importance sampling machinery is the change-of-measure sampling 
distribution, H(@;,u;). Regardless of whether or not the multivariate normality assumption is 
satisfied, H(0;,u;) can take the form of joint multivariate normal because it serves as a close 
approximation to the actual (and sometimes complicated) joint distribution of (@;, u;). 
Moreover, the random values drawn from the sampling distribution of H(@;, u;) are all 
independent, as opposed to the correlated draws from Gibbs or Metropolis-Hastings sampler in 
MCMC. The form of H(@;,u;) can be derived based on the results from Stage I, and drawing 
samples from multivariate normal distribution is very easy, hence the numerical approximation 


to the expectation in EM becomes quite straightforward. 
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The proposed MMLE and EM are based on the measurement error model that is 
essentially a random-effects meta-regression (Raudenbush & Bryk, 1985; Raudenbush & Bryk, 
2002, chapter 7), and it is in the broader framework for considering second-stage estimates in the 
presence of heteroscedasticity (Buonaccorsi, 1996). In particular, Buonaccorsi (1996) derived 
unbiased estimates of the structural model parameters (i.e., 2,,) under different specific forms of 
heteroscedasticity. Because the conditional standard error of measurement from the 3PL model is 
a nonlinear function of both item parameters and 8, Buonaccorsi’s (1996) derived results may not 
directly apply. However, the take-away message is the analytic results hold under the assumption 
of conditionally unbiased estimators and conditionally unbiased standard errors in stage I. 
Therefore, it is of paramount importance to obtain reliable 6 estimates in stage I. Diakow (2013) 
suggested using weighted maximum likelihood (WLE, Warm, 1989), and it is promising to 
check in the future for both unidimensional models and multidimensional models (Wang, 2015). 

The plausible value multiple-imputation method is another method of addressing 
measurement error issues in large-scale educational statistical inference. The statistical theory of 
this method is that, as long as the plausible values are constructed from the results of a 
comprehensive extensive marginal analysis, population characteristics can be estimated 
accurately without attempting to produce accurate point estimates for individual students 
(Sirotnik & Wellington, 1977; Mislevy, et al., 1992). Because most imputation procedures 
available in standard statistical software packages (e.g., SAS, Stata, and SPSS) assume that 
observations are independent, research on imputation strategies in the context of linear mixed 
effects models (or multilevel models) is still limited. From a theoretical perspective, using a 
multilevel model at the imputation stage is recommended to ensure congeniality between the 


imputation model and the model used by the analyst (Meng, 1994; Drechsler, 2015). Several 
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researches have demonstrated plausible values drawn from a simplified model without 
accounting for higher level dependency yielded substantial bias for random effects and negligible 
bias for fixed effects in secondary analysis (Monseur & Adams, 2009; Diakow, 2010; Drechsler, 
2015). Future research could compare the proposed methods with the plausible value approach. 

The two methods considered in the paper (MMLE and EM) account for the potentially 
non-constant error variance in the dependent variable by including a measurement error model 
with heteroskedastic variance at the lowest level of the multilevel model. We consider these two 
methods convenient and useful alternative to the well-studied multiple imputation method. One 
profound advantage of the proposed methods is that is does not require a correct conditioning 
model, which is required in the multiple imputation method. This is important because it is 
almost infeasible to find, and to sample from, a correct conditioning model that is exhaustive of 
all possible nesting structures and secondary analyses are impossible to predict. However, these 
two proposed methods do rely on the precision of 6 and its SE estimates. 

In this paper, we provide technical details for the three two-stage methods for interested 
readers to replicate and extend our study for other types of linear or nonlinear mixed effects 
models. The source code of all methods will also made available to readers upon request. On the 
other hand, the combined model (e.g., Equation 7) could potentially be fitted using off-the-shelf 
specialized software packages that can handle heteroskedastic variance at the lowest level, such 
as the gllamm command (Rabe-Hesketh, et al., 2004) in Stata (StatCorp, 2011) and HLM 
(Raudenbush, Bryk & Congdon, 2004). 

There are two limitations of the study that worth mentioning. First, the IRT item 
parameters are assumed known throughout the study. If in case the calibration sample size is 


small that the sampling error can no longer be ignored, the Bootstrap—calibrated interval 
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estimates for 8 (Liu & Yang, 2018) could be applied in stage I of the proposed two-stage 
framework. Second, while we focused only on the model parameters’ point estimates and 
standard error estimates, future studies could go one step further to evaluate the power of 
detecting significant covariates (Ye, 2015). For that purpose, the simulation design will focus on 
manipulating the effect size of the covariate (treatment effect) and the amount of measurement 


error (which could be manipulated by test length). 
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